INN Hotels Project

Context

A significant number of hotel bookings are called-off due to cancellations or no-shows. The typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.

The new technologies involving online booking channels have dramatically changed customers’ booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.

The cancellation of bookings impact a hotel on various fronts:

Objective

The increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. INN Hotels Group has a chain of hotels in Portugal, they are facing problems with the high number of booking cancellations and have reached out to your firm for data-driven solutions. You as a data scientist have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.

Data Description

The data contains the different attributes of customers' booking details. The detailed data dictionary is given below.

Data Dictionary

Importing necessary libraries and data

Data Overview

Insights:
Dataset has 5 categorical columns.
Dataset has 14 numerical columns.
Insights :
Dataset has no null values
Insights:

Quanititative Data:

Max no.of Adults booked is 2.0 and 50% is 4.0

Max weekend nights are 7.0

no of week nights median is around 2.0 and max is 17.0

Type of meal plan is categorical and has 4 types

Lead time max is 443 min and 50% is around 57 mins.

Arrival year max is 2018 and Arrival month has max booking in Dec. and 50% of them in August.

max no. of previous cancellations are 13 for some bookings.

Booking Status is a categorical column and has caneled and not_canceled status.
Insights:
No duplicate rows are in the dataset provided.
Insights:
hoteldata has 36275 unique rows.

Exploratory Data Analysis (EDA)

Leading Questions:

  1. What are the busiest months in the hotel?
  2. Which market segment do most of the guests come from?
  3. Hotel rates are dynamic and change according to demand and customer demographics. What are the differences in room prices in different market segments?
  4. What percentage of bookings are canceled?
  5. Repeating guests are the guests who stay in the hotel often and are important to brand equity. What percentage of repeating guests cancel?
  6. Many guests have special requirements when booking a hotel room. Do these requirements affect booking cancellation?
Univariate Analysis

booking_status

Insights:
Booking that are not canceled are of 67.2 percent. 
Canceled are almost 33 percent.

no_of_adults

Insights:
No of Adults doesnt have outliers. 

Around 25000 bookings have no of adults 2.0

no_of_children

Insights:

More than 30000 bookings have no of children between 0 and 1.
there are few booking with 2 children.

no_of_weekend_nights

Insights:

Insights:
46.5% of the bookings have 0 weekend nights.
no_of_week_nights

Insights:

31.5% of the bookings have no of week nights as 2.

Insights:

Around 77 % of the bookings have type of meal plan as Meal Plan 1 ie., Breakfast. 
Most of the bookings are tending towards the breakfast plan.
Insights:

Almost 97 % of the bookings have no car parking space requirement.

Mostly the hotel might be in the public transport connected area or the hotel might have parking that is available and free to everyone. Special car parking space is not required for the visitors.

Insights:

Around 77 % of the bookings have Room type 1 reserved.

No booking has Room type 3.
Insights:

Lead time distribution is highly right skewed distribution with most of them @ 0.

Insights :

More than 90% of the bookings have arrival year 2018

What are the busiest months in the hotel?

arrival_month

Insights:

Oct month has maximum number of bookings with more than 5000 and around 15%.

The busiest month in the year for the hotel INN is in October, the month 10.

arrival_date

Insights:
Arrival Date ranges widely and almost uniformly distribution for the bookings.

market_segment_type

Insights:

Market segment type online contributes to around 65% of the bookings.
The next market segement type is offline. (may be in person or on call agent bookings).

repeated_guest

Insights:

Around 98% of the bookings are not repeated.

Only 2.6 % of the bookings are repeated.

no_of_previous_cancellations

Insights:

All the bookings looks to be new customers and no no of previous cancellations.

no_of_previous_bookings_not_canceled

Insights:

For almost all the bookings the no of previous bookings not canceled was 0.

avg_price_per_room

Insights:

Average Prive per room shows a almost normal distribution but with a long tail.

Insights:

Average price per room shows a almost normal distribution now.

no_of_special_requests

Insights :
Around 30% of the bookings have a special requests.

booking_status

What percentage of bookings are canceled?
Insights:

Around 67% of the bookings are not canceled.

But 30% of the bookings are canceled which is still more.

arrival_month vs booking_status

Insights :
October has the highest number of bookings and also highest number of cancellations.

Month of January has least number of cancellations. 
Insights:
Oct has the highest number of guests and its the busiest month for the hotel.
Insights:
Repeated guest and no of previous booking not canceled are having good corelation.

No of previous cancellations and no of previous bookings not cancelled are also having correlation.
lets do a pair plot to see if we can get more data on the linear relations.
Insights :
No conclusions can be drawn from the pairplot

Which market segment do most of the guests come from?

market segment type vs booking status

Insights :
market segment online has the maximum guests.
And online market segment has the maximum cancellations.
Aviation and Complementary have almost 0 cancellations.
Hotel rates are dynamic and change according to demand and customer demographics. What are the differences in room prices in different market segments?
Insights:

Online median prices are higher than the other market segment type.

And for market segment type online has a range higher for online.

Many guests have special requirements when booking a hotel room. Let's see how it impacts cancellations

no_of_special_requests vs booking_status

Insights :

Cancellations are more when there is no special request. When there are special requests there is no cancellation of the booking.

insights:
Cancellations are more when there is no special requests.
special requests vs avg price of the room
Insights :

Average price per room median is more when there is more no of special requests.

We saw earlier that there is a positive correlation between booking status and average price per room. Let's analyze it

avg price per room vs booking status

Insights:

Most of the cancellations are from the range of 100 to 200.
booking status vs leadtime
Insights :

Family members with 2 have more bookings and also cancellations.

Repeating guests are the guests who stay in the hotel often and are important to brand equity. What percentage of repeating guests cancel?

repeated guest vs booking status

Insights:

More than 20000 repeated guests bookings are not cancelled.

Data Preprocessing

There are no missing values, no treatment is required.

Outlier detection is done below.

We are not treating the outliers as its a valid data.
Data Preparation for modeling

We want to predict which bookings will be canceled.

Before we proceed to build a model, we'll have to encode categorical features.

We'll split the data into train and test to be able to evaluate the model that we build on the train data.

Building a Logistic Regression model

Model evaluation criterion

Model can make wrong predictions as:

  1. Predicting a customer will not cancel their booking but in reality, the customer will cancel their booking.
  2. Predicting a customer will cancel their booking but in reality, the customer will not cancel their booking.

Which case is more important?

How to reduce the losses?

First, let's create functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model.

Building the model

Logistic Regression (with statsmodels library)

Observations

Negative values of the coefficient shows that probability of booking being a cancelled decreases with the increase of corresponding attribute value.

Positive values of the coefficient show that that probability of booking being a cancelled increases with the increase of corresponding attribute value.

p-value of a variable indicates if the variable is significant or not. If we consider the significance level to be 0.05 (5%), then any variable with a p-value less than 0.05 would be considered significant.

But these variables might contain multicollinearity, which will affect the p-values.

We will have to remove multicollinearity from the data to get reliable coefficients and p-values.

There are different ways of detecting (or testing) multi-collinearity, one such way is the Variation Inflation Factor.

Checking Multicollinearity

Additional Information on VIF

Variance Inflation factor: Variance inflation factors measure the inflation in the variances of the regression coefficients estimates due to collinearity that exist among the predictors. It is a measure of how much the variance of the estimated regression coefficient βk is "inflated" by the existence of correlation among the predictor variables in the model.

General Rule of thumb: If VIF is 1 then there is no correlation among the kth predictor and the remaining predictor variables, and hence the variance of β̂k is not inflated at all. Whereas if VIF exceeds 5, we say there is moderate VIF and if it is 10 or exceeding 10, it shows signs of high multi-collinearity. But the purpose of the analysis should dictate which threshold to use.

None of the variables exhibit high multicollinearity, so the values in the summary are reliable.

Let's remove the insignificant features (p-value>0.05).

Observations

The above process can also be done manually by picking one variable at a time that has a high p-value, dropping it, and building a model again. But that might be a little tedious and using a loop will be more efficient.

Now no feature has p-value greater than 0.05, so we'll consider the features in X_train1 as the final ones and lg1 as final model.

Converting coefficients to odds

The coefficients of the logistic regression model are in terms of log(odd), to find the odds we have to take the exponential of the coefficients. Therefore, odds = exp(b) The percentage change in odds is given as odds = (exp(b) - 1) * 100

Model performance evaluation

ROC-AUC

ROC-AUC on training set

Model Performance Improvement

Let's see if the recall score can be improved further, by changing the model threshold using AUC-ROC Curve.
Optimal threshold using AUC-ROC curve

Checking model performance on training set

Model is better with 0.36 threshold

f1 score isbetter with 0.36 threshold

Let's check the performance on the test set

ROC curve on test set
Using threshold 0.37
Using threshold 0.42

Insights :

Logistic model has provided the final F1 score as 70 with 0.37 threshold.

lets do a Decision Tree model to see if we can get a better F1 score.

Building a Decision Tree model

Model performance evaluation

Model can make wrong predictions as:

  1. Predicting a customer will not be behind on payments (Non-Delinquent) but in reality the customer would be behind on payments.

  2. Predicting a customer will be behind on payments (Delinquent) but in reality the customer would not be behind on payments (Non-Delinquent).

Which case is more important?

How to reduce this loss i.e need to reduce False Negatives?

First, let's create functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model.

Checking model performance on training set

Checking model performance on test set

Before pruning the tree let's check the important features.

lets get the important features

Insights:

lead time, average price per room , market segment online, arrival date, no of special requests, arrival month, no of week nights are most important features that contribute the booking status.

Pruning the tree

Pre-Pruning

Checking performance on training set

Checking performance on test set

Cost Complexity Pruning

Next, we train a decision tree using the effective alphas. The last value in ccp_alphas is the alpha value that prunes the whole tree, leaving the tree, clfs[-1], with one node.

For the remainder, we remove the last element in clfs and ccp_alphas, because it is the trivial tree with only one node. Here we show that the number of nodes and tree depth decreases as alpha increases.

Recall vs alpha for training and testing sets

Checking model performance on test set

Lead time, average room per room, market segment type online, no of special requests, arrival date, arrival month , no of wekeend nights, no of week nights are still the important features.

Model Performance Comparison and Conclusions

Comparison of Models and Final Model Selection

Actionable Insights and Recommendations